I’ve pulled the provided data for the no-prize competiton, “Pump it Up: Data Mining the Water Table” from Driven Data, supported by Taarifa and the Tanzanian Ministry of Water and performed exploratory data analysis. I will be storing this analysis to soon use in a machine learning exercise and participate in the competition, the goal of which is to predict when a well will fail given all information but its operational status.
I’m thrilled to see data science used for humanitarian ends and I hope this is the first of many opportunities I have to engage with this kind of work.
Taarifa describes itself as “an open source platform for the crowd sourced reporting and triaging of infrastructure related ussues”. To this end, they’ve compiled and provided a dataset describing the condition of water wells in Tanzania in collaboration with the Tanzanian Ministry of Water. Users of the well are provided with a means of reporting the status of a well (functional, functional needs repair, or non functional) which is associated with an ID containing some 40 descriptors including the ID itself.
In typical data competition style, the goal of the competition is to take this given dataset (the train set) and create a machine learning algorithm that will perform as well as possible against the test set. In this document, I do not employ any machine learning, I am focused exclusively on a thorough exploratory analysis both to hone my R and data visualization abilities as well as preparing as thorough a possible report of the data to inform any machine learning I may later use.
With that I’ll initialize some libraries I use and load in the data. Note that the status of the well comes in a separate CSV file, so I’ve loaded that one in and added its values to a new column in the main data file.
Before beginning any analysis I’ve renamed several of the variables to follow standard R format and removed several factors that were either redundant or irrelevant. I’ve also created a handful of subsets that will come in handy throughout.
Let’s make sure we have the same number of entries in the outcomes and data files, and that the IDs of each match up
## [1] TRUE
Fantastic, this evaluates to true. We know at least this part of our data is clean.
So let’s poke around our data. We know we have 59400 different entries, and 22 variables including the status.
## 'data.frame': 59400 obs. of 22 variables:
## $ id : int 69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
## $ amount.tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ date.recorded : Date, format: "2011-03-14" "2013-03-06" ...
## $ height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ basin : Factor w/ 9 levels "Internal","Lake Nyasa",..: 2 5 6 8 5 6 1 4 4 5 ...
## $ region : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 4 10 9 13 5 21 18 18 20 5 ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ pubmeet : Factor w/ 3 levels "","False","True": 3 1 3 3 3 3 3 3 3 3 ...
## $ scheme.mgmt : Factor w/ 13 levels "","Company","None",..: 9 4 9 9 1 9 9 1 9 1 ...
## $ permit : Factor w/ 3 levels "","False","True": 2 3 3 3 3 3 3 3 3 3 ...
## $ constr.year : int 1999 2010 2009 1986 0 2009 0 0 0 0 ...
## $ extr.type.class: Factor w/ 7 levels "gravity","handpump",..: 1 1 1 6 1 6 2 2 2 2 ...
## $ mgmt : Factor w/ 12 levels "company","other",..: 8 12 8 8 2 8 8 12 8 8 ...
## $ mgmt.group : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 2 5 5 5 5 5 ...
## $ payment.type : Factor w/ 7 levels "annually","monthly",..: 1 3 6 3 3 6 3 7 3 3 ...
## $ water.qual : Factor w/ 8 levels "coloured","fluoride",..: 7 7 7 7 7 5 7 4 5 7 ...
## $ quant : Factor w/ 5 levels "dry","enough",..: 2 3 2 1 4 2 2 2 4 2 ...
## $ source : Factor w/ 10 levels "dam","hand dtw",..: 9 6 1 4 6 5 4 8 4 8 ...
## $ wpt.type : Factor w/ 7 levels "cattle trough",..: 2 2 3 3 2 3 5 5 5 5 ...
## $ outcomes : Factor w/ 3 levels "func","repair",..: 1 1 1 3 1 1 3 3 3 1 ...
Let’s see what the well outcome distribution looks like so we know what we’re normalizing against.
Now that we have a good picture of what is included and relevant in our data, let’s do some analysis.
A good first step would be to use the ggmaps package to look at Tanzania overlaid with a density spread of each level of the outcomes factor.
Interestingly, all of the functioning wells are around the border of the country, excluding the western border. Everything in the interior and to the west is in some state of disrepair.
What about height? The unit of height is not given in the data set but I assume it is meters.
It looks like height might also play into the equation. Above 1500 meters or so it appears that wells are more functional than not. At ~800 meters we see that non functioning wells dominate the peak and at ~1200 we see that wells in need of repair dominate the peak.
Let’s take a look at the distribution of well status by region and basin. This is the first time we encounter discrete variables, which happen to comprise the majority of our data set. This prevents us from using ggpairs to quickly find correlation between all of our factors and the outcomes, but it does give an opportunity to write two functions to effectively do the same thing.
The first function works to produce a stacked bar plot with a column for each level of the input factor. I call it catPlots for Categorical Plots
The second will output the Cramer’s V of the input variable considered in all three binary combinations of the outcome levels. I’ve chosen to perform the test on all three permutations rather than just one test across the main dataset because eventually I want to apply the correlation to distinguish between all three groups and will most likely break the classifier into first determining if the well is functioning or not and only subsequent to a negative result will I distinguish between needs repair and non functional. I’ve also chosen to only output the actual value for Cramer’s V as in each case the Chi-Sq value is astronomically large and we just want to know how strong the significance is.
The rule of thumb I have learned for Cramer’s V is to consider a value of .1-.3 moderately significant and a value above .3 strongly significant, so these are the metrics by which I will compare the results we get below.
Then onto looking at the region and basin.
Both of these seem to carry weight in determining the status of the well. It is difficult to get truly in depth about this sort of thing without having a more thorough understanding of Tanzanian politics and demographics. To put this information to better use I would want to look at population by region, associate the basins with each region, and understand the distance of each well from the basin. I’ll leave all that out of the present study as I think I could easily overextend that analysis and I’d rather look at the more readily available data before getting bogged down in associating outside data with the given data.
But before we go, let’s take a peak at the Cramer’s V for each.
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.2198142
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.248826
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.2299771
This falls into our moderately significant category.
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.143766
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.1062122
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.1599134
This does as well, but less so than region. I’ll use both for the classifier.
That wraps up geography, we’ll take a look at demographics next.
We’ll make a frequency plot to look at the relationship between outcomes and population. I’ve created a new dataset, popsub, to exclude the null values and the handful of massive outliers that skewed the datatset.
It strikes me that around the 200 population mark the func and non func trends suddenly become much closer to each other, indicating that as population increases, the wells are much more likely to be non functional. This is no surprise. Let’s see what else we can find.
Well, the means and medians of the functional and nonfunctional groups are right on top of each other. But the functional but needs repair group has a significantly higher median and mean. Looking back up at the graph this makes a lot of sense, the spikes in the lower regions of the repair plot are much more comparable in height to the higher population spikes than in the other two groups. Maybe there is some threshold of maintenance that a local community can support to avoid complete failure but still fall short of performance standards that might require a larger industrial organization to maintain.
The type of water pumped through the well probably has a big effect on its condition. Let’s look at a couple of the different factors that could affect it. We have variables to look at: Extraction Type Class, Water Quality, Quantity, Waterpoint Type, and Amount value for the waterhead (the only numerical variable). Let’s breeze through the discrete variables and look at their bar plots and Cramer’s V values
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.09674588
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.2341815
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.3325577
These Cramer’s V values are quite useful, especially considering my plan to sort by functional wells first then by wells in disrepair. Motor pumped wells fail most frequently, supporting the “keep it simple” maxim.
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.06093606
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.1404619
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.1854727
These are only slightly correlated, but we’ll keep them nonetheless.
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.08188798
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.2375964
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.4290927
Just like in extraction type, we get some very useful Cramer’s Vs here. Obviously if a well is dry it’s not going to be very useful. This could skew the results, we might need to look at this more carefully when we put it in practice.
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.1337592
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.16742
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.3584663
Another useful set of values. I don’t know enough about water pipes to shed light onto this, but as usual the unknown values skew the data quite a bit.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It seems likely that the date a well was built is going to influence its status. Let’s check it out with a frequency plot.
Ok, but we know we have way more functional wells than not anyway, so let’s normalize this.
It definitely looks like there’s a strong correlation here, the earlier years are dominated by non functioning wells while more recently constructed wells are in better shape. A box plot and some statistics might be more informative.
## constrsub$outcomes: func
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1995 2003 2000 2009 2013
## --------------------------------------------------------
## constrsub$outcomes: repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1985 1998 1995 2006 2013
## --------------------------------------------------------
## constrsub$outcomes: nonfunc
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1981 1994 1992 2004 2013
This is a rich source of info for us. Clearly the year a well was made is going to have a huge influence on whether or not it is still functioning. How rich?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.1440896
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.1072128
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.260058
## X^2 df P(> X^2)
## Likelihood Ratio 3999.5 106 0
## Pearson 3950.8 106 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.304
## Cramer's V : 0.226
## X^2 df P(> X^2)
## Likelihood Ratio 622.11 53 0
## Pearson 701.69 53 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.168
## Cramer's V : 0.17
## X^2 df P(> X^2)
## Likelihood Ratio 309.12 53 0
## Pearson 291.85 53 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.13
## Cramer's V : 0.131
## X^2 df P(> X^2)
## Likelihood Ratio 3726.8 53 0
## Pearson 3674.6 53 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.304
## Cramer's V : 0.319
Excellent, we can see that clearly the differences are statistically significant, but the only correlation that is close to strong enough to be worth considering is that between functional and non functional wells. With a Cramer’s V of .319, we would do well to remember this when we try to predict the test dataset’s results.
What about the dates that the report was sent in? There are no reports before 2011 so I’ve set the axis to start there.
This actually gives us some interesting metadata about the dataset. Its likely that the program used to report the information has become more popular in recent years, we can see evidence of that from the series of smaller spikes after 2013. It looks like these spikes followed a major push in early 2013 to collect data en masse. It would be informative to check for redundant entries and see if any of these smaller reports are from locals who hopped on board after the push.
What if we associate the recorded date with the year the well was constructed?
The remaining variables don’t lend themselves to being treated as part of any particular group, so we’ll just blow through them and look at a bar plot + Cramer’s V for each of them. I’m including missing values as it will be important to have a way to classify missing entries in our test set.
Is the status of the well as a public meeting hub relevant?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.0485091
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.04616502
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.07824534
Very low Cramer’s V, not relevant.
Is the manager relevant?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.126361
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] NaN
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.1507797
Very low Cramer’s V, marginally relevant.
Does the well’s permit status affect the outcome?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.03357306
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.03666399
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.03482529
Not at all.
Is management group relevant?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.06311428
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.0774946
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.048672
Not at all.
Is payment type relevant?
## [1] "Functional and Needs Repair"
## [[1]]
## [1] 0.1012351
##
## [1] "Non Functional and Needs Repair"
## [[1]]
## [1] 0.1873774
##
## [1] "Functional and Non Functional"
## [[1]]
## [1] 0.2559655
We finally get a bit of useful information out of this with significant Cramer’s V for NotFunction-Repair and Function-NotFunction groups.
Many of the plots were similar due to the categorical nature of the data, and it is difficult to point to any particular ones as being more useful than the others. Perhaps the most informative was the first, the density plot of wells by status over the map of Tanzania, which I’ve reproduced below. We see that there are no functional wells whatsoever in the interior or the south west border. Wells to the west that are not at 100% tend to be in disrepair but still operational, while wells in complete disrepair are spread relatively uniformly, with hotspots generally matching the functional wells.
The year the well was constructed is clearly fundamental to determining its status, clearly visualized below. Wells constructed before 2000 tend to be in disrepair, and likely completely disfunctional, but wells made after 2000 tend to be functional and if not, they are more likely to be functional but need repair.
And finally the bar chart demonstrating well status dependence on region. I think this one could be the most interesting to me as it is the most ripe for doing some really exploratory analysis with the demographics of Tanzania if I can get my hands on a good dataset of demographic info. We saw earlier the heavy numerical dependence of status on region, if we could take this information and associate it with information on overall wealth, population,infrastructure, political condition, etc. we might be able to pull out some truly interesting and informative details!
This is a rich dataset and there are a lot of influential variables that can help us determine the status of a well. The grand majority of variables were categorical which I think is both a blessing and a curse. On one hand, this lets us give clear and discrete weights to each level of the categorical factors which I think will be useful in designing classifiers. On the other hand, the data was more difficult to work with and create informative plots as numerical data lends itself to more intuitive visualization. Importantly, one of the few numerical variables, position, was very informative and showed us a very clear distribution of well status depending on its position relative to the borders of Tanzania.
I normalized almost all the data to compensate for the different numbers of reported well outcomes, but it could be useful to have both normalized and non-normalized data on hand as certainly well predictions will be influenced by the prevalence of functional wells just as much as any other variable. Although perhaps just taking into account the distribution of the wells in the first place would work in a multinomial regression if all other variables stayed normalized. If I continue, which I do hope to, I would like to pull demographic data for Tanzania and do a sort of meta-normalization of population, politics, and water quality of each region as well as associating each region with a batch of coordinates.
The data came from:
https://www.drivendata.org/competitions/7/
Throughout this study I made extensive use of the following sources:
http://www.statmethods.net/stats/index.html
https://onlinecourses.science.psu.edu/stat200/node/71
http://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm
As well as the Udacity course on R.
This was good fun and I look forward to merging what I learned here with machine learning!